he pattern discovery power
interesting to examine whether the word length (the k value of
r approach) matters. Based on the above two comparisons, the k
s varied from two to six to examine whether the accuracy of
t-free approach can be maintained. Figure 7.15 shows the result.
seen that when the word length increases, the correlation between
ment distance and the k-mer-distance decreases. This means that
r to use shorter words for using the alignment-free approach for
comparison so as to maintain the discovery power.
An examination whether the k-mer word length matters for the alignment-free
o maintain the accuracy in multiple sequence comparisons.
er machine
hether a word matrix (a k-mer frequency matrix) generated by an
t-free approach can show a good discrimination power needs to
ined. Therefore, ten SARS-CoV genome sequences and ten
oV-2 genome sequences were downloaded from NCBI. A word
ibrary) of the 3-mers for these genome sequences was derived.
16 shows a heatmap generated for this word matrix. It can be seen
n unsupervised machine learning model can make a good
n between two types of SARS genomes using the 3-mer word
y.